Accessing Data

As pandas is built on Python, any means available in Python can be used to retrieve data from outside source. This really makes the possibility of the data that can be accessed unlimited including text files, excel spreadsheets, web sites and services, databases and cloud based services.

Setting up the Python notebook


In [1]:
# import pandas and numpy
import numpy as np
import pandas as pd

# set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)

CSV & Text/Tabular Format


In [2]:
# view the first five lines of data/msft.csv
! head -n 5 ../../data/msft.csv # OS/Linux
# !type ..\..\data\msft.csv     # on windows


Date,Open,High,Low,Close,Volume,Adj Close
2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63
2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91

Reading a CSV file into DataFrame


In [3]:
# read in msft.csv into a DataFrame
msft = pd.read_csv("../../data/msft.csv")
msft.head()


Out[3]:
         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
4  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

In [4]:
# specifying the index column
msft = pd.read_csv("../../data/msft.csv", index_col=0)
msft.head()


Out[4]:
             Open   High    Low  Close   Volume  Adj Close
Date                                                      
2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

The data field is now the index however because of this it is also not a column data. If you want to use the date as a column, you will need to create a new column and assign the index labels to that column.


In [5]:
# examine the types of the columns in the DataFrame
msft.dtypes


Out[5]:
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object

In [6]:
# to force type of columns, use the dtypes parameter
# following forces the column to be float64
msft = pd.read_csv("../../data/msft.csv", dtype={'Volume': np.float64})
msft.dtypes


Out[6]:
Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume       float64
Adj Close    float64
dtype: object

Specifying column names


In [7]:
# specify a new set of names for the columns
# all lower case, remove space in Adj Close
# also, header = 0 skips the header row
df = pd.read_csv("../../data/msft.csv",header=0,names=['open','high','low','close','volume','adjclose'])
df.head()


Out[7]:
             open   high    low  close   volume  adjclose
2014-07-21  83.46  83.53  81.81  81.93  2359300     81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800     83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000     83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600     84.91
2014-07-15  84.30  84.38  83.20  83.58  1874700     83.58

In [9]:
# read in data only in the Date and close columns,
# use Date as the inde
df2 = pd.read_csv("../../data/msft.csv",usecols=['Date','Close'],index_col=['Date'])
df2.head()


Out[9]:
            Close
Date             
2014-07-21  81.93
2014-07-18  83.35
2014-07-17  83.63
2014-07-16  84.91
2014-07-15  83.58

Saving DataFrame to a CSV file


In [10]:
# save df2 to a new csv file
# also specify naming the index as date
df2.to_csv("../../data/msft_modified.csv",index_label='date')

It was necessary to tell the method that the index label should be saved with a column name of date using index_label=date. Otherwise, the index does not have a name added to the first row of the file, which makes it difficult to read back properly.


In [11]:
# view the start of the file just saved
!head ../../data/msft_modified.csv


date,Close
2014-07-21,81.93
2014-07-18,83.35
2014-07-17,83.63
2014-07-16,84.91
2014-07-15,83.58
2014-07-14,84.4
2014-07-11,83.35
2014-07-10,83.42
2014-07-09,85.5

General Field-Delimited Data


In [12]:
# use read_table with sep=',' to read a csv
df=pd.read_table("../../data/msft.csv",sep=',')
df.head()


Out[12]:
         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
4  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

In [13]:
# save as pipe delimited
df.to_csv("../../data/msft_piped.txt",sep='|')
# check if it worked
!head -n 5 ../../data/msft_piped.txt


|Date|Open|High|Low|Close|Volume|Adj Close
0|2014-07-21|83.46|83.53|81.81|81.93|2359300|81.93
1|2014-07-18|83.3|83.4|82.52|83.35|4020800|83.35
2|2014-07-17|84.35|84.63|83.33|83.63|1974000|83.63
3|2014-07-16|83.77|84.91|83.66|84.91|1755600|84.91

Handling noise rows in a dataset

Sometimes, data in a field-delimited file may contain erroneous headers and footers. Examples can be company information at the top, such as invoice number, addresses and summary footers. Sometimes data is stored on ever other line. These situations will cause error when pandas tries to open files. To handle these scenarios some useful parameters can be used.


In [14]:
# messy file
!head ../../data/msft2.csv  # Linux


This is fun because the data does not start on the first line
Date,Open,High,Low,Close,Volume,Adj Close

And there is space between the header row and data
2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63
2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91
2014-07-15,84.30,84.38,83.20,83.58,1874700,83.58
2014-07-14,83.66,84.64,83.11,84.40,1432100,84.40

In [15]:
# read, but skip rows 0,2 and 3
df = pd.read_csv("../../data/msft2.csv",skiprows=[0,2,3])
df


Out[15]:
         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
4  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58
5  2014-07-14  83.66  84.64  83.11  84.40  1432100      84.40
6  2014-07-11  83.55  83.98  82.85  83.35  2001400      83.35
7  2014-07-10  85.20  85.57  83.36  83.42  2713300      83.42
8  2014-07-09  84.83  85.79  84.76  85.50  1540700      85.50

Another common situation is where a file has content at the end of the file which should be ignored to prevent an error, such as the following:


In [16]:
# another messy  file with mess at the end
!cat ../../data/msft_with_footer.csv # osx / Linux


Date,Open,High,Low,Close,Volume,Adj Close
2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35

Uh oh, there is stuff at the end.

In [20]:
# skip only two lines at the end
# engine parameter to force python implementation rather than default c implementation
df = pd.read_csv("../../data/msft_with_footer.csv",skipfooter=2,engine='python')
df


Out[20]:
         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35

In [21]:
# only process the first three rows
pd.read_csv("../../data/msft.csv",nrows=3)


Out[21]:
         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63

In [22]:
# skip 100 lines, then only process the next five
pd.read_csv("../../data/msft.csv", skiprows=100, nrows=5, header=0,names=['open','high','low','close','vol','adjclose'])


Out[22]:
             open   high    low  close      vol  adjclose
2014-03-03  80.35  81.31  79.91  79.97  5004100     77.40
2014-02-28  82.40  83.42  82.17  83.42  2853200     80.74
2014-02-27  84.06  84.63  81.63  82.00  3676800     79.36
2014-02-26  82.92  84.03  82.43  83.81  2623600     81.12
2014-02-25  83.80  83.80  81.72  83.08  3579100     80.41

Reading and Writing data in Excel Format

pandas support reading data in Excel 2003 and newer formats using the pd.read_excel() function or via ExcelFile class.


In [23]:
# read excel file
# only reads first sheet
df = pd.read_excel("../../data/stocks.xlsx")
df.head()


Out[23]:
        Date   Open   High    Low  Close   Volume  Adj Close
0 2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1 2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2 2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
3 2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
4 2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58

In [26]:
# read from the appl worksheet
aapl = pd.read_excel("../../data/stocks.xlsx", sheetname='aapl')
aapl.head()


Out[26]:
        Date   Open   High    Low  Close    Volume  Adj Close
0 2014-07-21  94.99  95.00  93.72  93.94  38887700      93.94
1 2014-07-18  93.62  94.74  93.02  94.43  49898600      94.43
2 2014-07-17  95.03  95.28  92.57  93.09  57152000      93.09
3 2014-07-16  96.97  97.10  94.74  94.78  53396300      94.78
4 2014-07-15  96.80  96.85  95.03  95.32  45477900      95.32

In [27]:
# save to excel file in worksheet sheet1
df.to_excel("../../data/stocks2.xlsx")

In [28]:
# write making the worksheet name MSFT
df.to_excel("../../data/stocks_msft.xlsx", sheet_name='MSFT')

To write more than one DataFrame to a single Excel file and each DataFrame object on a separate worksheet use the ExcelWriter object along with the with keyword.


In [29]:
from pandas import ExcelWriter
with ExcelWriter("../../data/all_stocks.xls") as writer:
    aapl.to_excel(writer,sheet_name='AAPL')
    df.to_excel(writer,sheet_name='MSFT')

In [30]:
# write to xlsx
df.to_excel("../../data/msft2.xlsx")

Reading and Writing JSON files

pandas can read and write data stored on JSON format.


In [31]:
# write the excel data to a JSON file
df.head().to_json("../../data/stocks.json")
!cat ../../data/stocks.json


{"Date":{"0":1405900800000,"1":1405641600000,"2":1405555200000,"3":1405468800000,"4":1405382400000},"Open":{"0":83.46,"1":83.3,"2":84.35,"3":83.77,"4":84.3},"High":{"0":83.53,"1":83.4,"2":84.63,"3":84.91,"4":84.38},"Low":{"0":81.81,"1":82.52,"2":83.33,"3":83.66,"4":83.2},"Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58},"Volume":{"0":2359300,"1":4020800,"2":1974000,"3":1755600,"4":1874700},"Adj Close":{"0":81.93,"1":83.35,"2":83.63,"3":84.91,"4":83.58}}

In [32]:
# read data in from JSON
df_from_json = pd.read_json("../../data/stocks.json")
df_from_json.head(5)


Out[32]:
   Adj Close  Close       Date   High    Low   Open   Volume
0      81.93  81.93 2014-07-21  83.53  81.81  83.46  2359300
1      83.35  83.35 2014-07-18  83.40  82.52  83.30  4020800
2      83.63  83.63 2014-07-17  84.63  83.33  84.35  1974000
3      84.91  84.91 2014-07-16  84.91  83.66  83.77  1755600
4      83.58  83.58 2014-07-15  84.38  83.20  84.30  1874700

Notice two slight differences here caused by the reading / writing of data from JSON. First the columns have been reordered alphabetically. Second, the index for DataFram although containing contnet, is sorted as a string.

Reading HTML data from the web

Underneath the covers pandas makes use of LXML, Html5Lib and BeautifulSoup4 packages.


In [33]:
# url to read
url = "http://www.fdic.gov/bank/individual/failed/banklist.html"
# read it
banks = pd.read_html(url)
# examine a subset of the first table read
banks[0][0:5].ix[:,0:4]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  
Out[33]:
                                           Bank Name                City  ST  \
0    The Farmers and Merchants State Bank of Argonia             Argonia  KS   
1                                Fayette County Bank          Saint Elmo  IL   
2  Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee  WI   
3                                     First NBC Bank         New Orleans  LA   
4                                      Proficio Bank  Cottonwood Heights  UT   

    CERT  
0  17719  
1   1802  
2  30003  
3  58302  
4  35495  

In [35]:
# write to html
# read the stock data
df=pd.read_excel("../../data/stocks.xlsx")
# write first 2 rows to HTML
df.head(2).to_html("../../data/stocks.html")
# check
!head -n 28 ../../data/stocks.html


<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Date</th>
      <th>Open</th>
      <th>High</th>
      <th>Low</th>
      <th>Close</th>
      <th>Volume</th>
      <th>Adj Close</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2014-07-21</td>
      <td>83.46</td>
      <td>83.53</td>
      <td>81.81</td>
      <td>81.93</td>
      <td>2359300</td>
      <td>81.93</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2014-07-18</td>
      <td>83.30</td>

Reading and Writing HDF5 format files

HDF5 is a data model, library and file format to store and manage data. It is commonly used in scientific computing environments. It supports an unlimited variety of data types and is designed for flexible and efficient I/O and for high volume and complex data.

HDF5 is portable and extensible allowing applications to evolve in their use of HDF5. HDF5 technology suite includes tools and applications to manage, manipulate, view and analyse data in HDF5 format.

HDF5 is:

  • A Versatile data model that can represent very complex data objects and wide variety of metadata
  • A completely portable file format with no limit on the number or size of data objects in a collection
  • A Software library that runs on range of computational platforms from laptops to massively parallel processing systems and implements high level API with C,C++,Fortran and Java interfaces
  • A rich set of integrated performance features that allows for access time and storage space optimizations.
  • Tools and applications to manage, manipulate, view and analyze the data in collection

HDF5Store is a hierarchical dictionary like object that reads and writes pandas objects to the HDF5 format.


In [36]:
# seed for replication
np.random.seed(123456)
# create a DataFrame of dates and random numbers in three columns
df = pd.DataFrame(np.random.randn(8,3),index=pd.date_range('1/1/2000', periods=8), columns=['A','B','C'])
# create HDF5 store
store = pd.HDFStore('../../data/store.h5')
store['df'] = df # persisting happened here
store


Out[36]:
<class 'pandas.io.pytables.HDFStore'>
File path: ../../data/store.h5
/df            frame        (shape->[8,3])

In [37]:
# read in data from HDF5
store = pd.HDFStore("../../data/store.h5")
df = store['df']
df


Out[37]:
                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04 -2.104569 -0.494929  1.071804
2000-01-05  0.721555 -0.706771 -1.039575
2000-01-06  0.271860 -0.424972  0.567020
2000-01-07  0.276232 -1.087401 -0.673690
2000-01-08  0.113648 -1.478427  0.524988

In [38]:
# this changes the DataFrame, but did not persist
df.ix[0].A = 1

# to persist the change, assign the dataframe to the
# HDF5 store object

store['df'] = df
# it is now persisted

# the following loads the store and
# shows the first two rows, demonstrating
# the persisting was done
pd.HDFStore("../../data/store.h5")['df'].head(2)


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  
Out[38]:
                   A         B         C
2000-01-01  1.000000 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215

Accessing data on the web and in the cloud

pandas makes it extremely easy to read data from the web and the cloud. All of the pandas functions we have examined so far can also be given an HTTP URL, FTP address or S3 address instead of a local file path.


In [44]:
# read csv directly from Yahoo! Finance from a URL
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv")
df[:5]


Out[44]:
   Unnamed: 0         time  AirPassengers
0           1  1949.000000            112
1           2  1949.083333            118
2           3  1949.166667            132
3           4  1949.250000            129
4           5  1949.333333            121

Reading and writing from/to SQL databases

pandas can read data from any SQL databases that support Python data adapters, that respect the Python DB-API. Reading is performed by using the pandas.io.sql.read_sql() function and writing to SQL databases using the .to_sql() method of DataFrame.


In [42]:
# reference SQLITE
import sqlite3

# read in the stock data from csv
msft = pd.read_csv("../../data/msft.csv")
msft['Symbol'] = "MSFT"
aapl = pd.read_csv("../../data/aapl.csv")
aapl['Symbol'] = 'AAPL'

# create connection
connection = sqlite3.connect("../../data/stocks.sqlite")
# .to_sql() will create sql to store the DataFrame
# in the specified table. if_exists specifies
# what to do if the table already exists

msft.to_sql("STOCK DATA", connection, if_exists="replace")
aapl.to_sql("STOCK DATA", connection, if_exists="append")

# commit the sql and close the connection
connection.commit()
connection.close()


/Users/cnc/anaconda/lib/python3.6/site-packages/pandas/core/generic.py:1345: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  chunksize=chunksize, dtype=dtype)

In [43]:
# read data
# connect to the database file
connection = sqlite3.connect("../../data/stocks.sqlite")

# query all records in STOCK_DATA
# returns a DataFrame
# index_col specifies which column to make the DataFrame index
stocks = pd.io.sql.read_sql("SELECT * FROM STOCK_DATA;", connection, index_col="index")

# close the connection
connection.close()

# report the head of the data received
stocks.head()


Out[43]:
             Date   Open   High    Low  Close   Volume  Adj Close Symbol
index                                                                   
0      2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93   MSFT
1      2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35   MSFT
2      2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63   MSFT
3      2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91   MSFT
4      2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58   MSFT

In [46]:
# open the connection
connection = sqlite3.connect("../../data/stocks.sqlite")

# construct the query string
query = "SELECT * FROM STOCK_DATA WHERE Volume > 29200100 AND Symbol='MSFT';"

# execute and close connection
items = pd.io.sql.read_sql(query,connection,index_col='index')
connection.close()

items


Out[46]:
             Date   Open   High    Low  Close    Volume  Adj Close Symbol
index                                                                    
1081   2010-05-21  42.22  42.35  40.99  42.00  33610800      36.48   MSFT
1097   2010-04-29  46.80  46.95  44.65  45.92  47076200      38.41   MSFT
1826   2007-06-15  89.80  92.10  89.55  92.04  30656400      35.87   MSFT
3455   2001-03-16  47.00  47.80  46.10  45.33  40806400      17.66   MSFT
3712   2000-03-17  49.50  50.00  48.29  50.00  50860500      19.48   MSFT

As these functions take a connection object, which can be any Python DB-API compatible data adapter, you can more or less work with any supported database data by simply creating an appropriate connection object. The code at pandas level should remain the same for any supported database.

Reading data from remote data services

pandas has direct support for various web-based data source classes in the pandas.io.data namespace. The primary class of interest is pandas.io.data.DataReader, which is implemented to read data from various supported sources and return it to the application directly as DataFrame.

Currently, support exists for the following sources via the DataReader class:

  • Daily historical prices stock from either Yahoo! and Google Finance
  • Yahoo! Options
  • Federal Reserve Economic Data Library
  • Kenneth French's Data Library
  • The World Bank

Reading Stock Data from Yahoo! and Google Finance


In [54]:
import pandas_datareader.data as web
import datetime

# start and end dates
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2014,1,27)

# read the MSFT stock data from Yahoo!
yahoo = web.DataReader('MSFT','yahoo',start,end)
yahoo.head()


Out[54]:
                 Open       High        Low      Close  Adj Close    Volume
Date                                                                       
2012-01-03  26.549999  26.959999  26.389999  26.770000  22.909807  64731500
2012-01-04  26.820000  27.469999  26.780001  27.400000  23.448965  80516100
2012-01-05  27.379999  27.730000  27.290001  27.680000  23.688589  56081400
2012-01-06  27.530001  28.190001  27.530001  28.110001  24.056585  99455500
2012-01-09  28.049999  28.100000  27.719999  27.740000  23.739935  59706800

In [55]:
# read from google
google = web.DataReader('MSFT','google',start,end)
google.head()


Out[55]:
             Open   High    Low  Close    Volume
Date                                            
2016-11-08  60.55  60.78  60.15  60.47  22935355
2016-11-09  60.00  60.59  59.20  60.17  49632479
2016-11-10  60.48  60.49  57.63  58.70  57822394
2016-11-11  58.23  59.12  58.01  59.02  38767843
2016-11-14  59.02  59.08  57.28  58.12  41328422

In [57]:
# specify we want all yahoo options data for AAPL
# this can take a little time...
from pandas_datareader.data import Options
aapl = Options('AAPL','yahoo')
# read all the data
data = aapl.get_all_data()
# examine the first six rows and four columns
data.iloc[0:6,0:4]


Out[57]:
                                              Last     Bid     Ask   Chg
Strike Expiry     Type Symbol                                           
2.5    2017-11-17 call AAPL171117C00002500  160.00  159.40  161.50  6.75
       2017-12-15 call AAPL171215C00002500  154.95  160.15  161.25  0.00
       2018-01-19 call AAPL180119C00002500  152.55  153.30  154.05  0.00
                  put  AAPL180119P00002500    0.02    0.00    0.02  0.00
       2018-04-20 put  AAPL180420P00002500    0.01    0.00    0.01  0.00
5.0    2017-11-17 call AAPL171117C00005000  151.80  150.80  151.60  0.00

In [59]:
# get all puts at strike price of $80 (first four columns only)
data.loc[(80, slice(None),'put'),:].iloc[0:5,0:4]


Out[59]:
                                            Last   Bid   Ask   Chg
Strike Expiry     Type Symbol                                     
80.0   2017-11-17 put  AAPL171117P00080000  0.01  0.00  0.05  0.00
       2017-12-15 put  AAPL171215P00080000  0.01  0.00  0.01  0.00
       2018-01-19 put  AAPL180119P00080000  0.01  0.00  0.02 -0.01
       2018-02-16 put  AAPL180216P00080000  0.02  0.00  0.03 -0.01
       2018-04-20 put  AAPL180420P00080000  0.10  0.06  0.14  0.00

In [61]:
data.loc[(80,slice('20150117','20150417'),'put'),:].iloc[:,0:4]


Out[61]:
Empty DataFrame
Columns: [Last, Bid, Ask, Chg]
Index: []

In [64]:
# msft calls expiring on 2015-01-05
expiry = datetime.date(2015, 1, 5)
msft_calls = Options('MSFT','yahoo').get_call_data(expiry=expiry)
msft_calls.iloc[0:5,0:5]


Out[64]:
                                             Last    Bid    Ask  Chg  PctChg
Strike Expiry     Type Symbol                                               
60.0   2017-11-10 call MSFT171110C00060000  15.05  18.75  18.90  0.0     0.0
65.0   2017-11-10 call MSFT171110C00065000  13.85  17.85  19.40  0.0     0.0
67.0   2017-11-10 call MSFT171110C00067000  17.25   0.00   0.00  0.0     0.0
67.5   2017-11-10 call MSFT171110C00067500   9.05  11.30  11.45  0.0     0.0
68.0   2017-11-10 call MSFT171110C00068000   7.80  10.80  10.95  0.0     0.0

In [65]:
# msft calls expiring on 2015-01-17
expiry = datetime.date(2015,1,17)
aapl_calls = aapl.get_call_data(expiry=expiry)
aapl_calls.iloc[0:5,0:4]


Out[65]:
                                             Last   Bid    Ask       Chg
Strike Expiry     Type Symbol                                           
105.0  2017-11-10 call AAPL171110C00105000  69.85   0.0   0.00  0.000000
110.0  2017-11-10 call AAPL171110C00110000  62.35   0.0   0.00  0.000000
115.0  2017-11-10 call AAPL171110C00115000  59.85   0.0   0.00  0.000000
120.0  2017-11-10 call AAPL171110C00120000  51.98  52.2  52.75  8.849998
125.0  2017-11-10 call AAPL171110C00125000  32.35  37.2  39.05  0.000000

Reading from Federal Reserve Economic Data


In [66]:
gdp = web.DataReader("GDP","fred",datetime.date(2012,1,1),datetime.date(2014,1,27))
gdp


Out[66]:
                  GDP
DATE                 
2012-01-01  15973.881
2012-04-01  16121.851
2012-07-01  16227.939
2012-10-01  16297.349
2013-01-01  16475.440
2013-04-01  16541.390
2013-07-01  16749.349
2013-10-01  16999.888
2014-01-01  17031.324

In [67]:
# get compensation of employees: Wages and Salaries
web.DataReader("A576RC1A027NBEA","fred",datetime.date(1929,1,1),datetime.date(2013,1,1))


Out[67]:
            A576RC1A027NBEA
DATE                       
1929-01-01             50.5
1930-01-01             46.2
1931-01-01             39.2
1932-01-01             30.5
1933-01-01             29.0
...                     ...
2009-01-01           6251.4
2010-01-01           6377.5
2011-01-01           6633.2
2012-01-01           6930.3
2013-01-01           7114.4

[85 rows x 1 columns]

Accessing Kenneth French's Data

Kenneth R French is a professor of finance at Tuck School of Business at Dartmouth University. He has created an extensive library of economic data, which is available for download over the Web.


In [68]:
# read from Kenneth French fama global factors data set
factors = web.DataReader("Global_Factors","famafrench")
factors


Out[68]:
{0:          Mkt-RF   SMB   HML   WML    RF
 Date                                   
 2010-01   -3.70  2.70 -0.29 -2.23  0.00
 2010-02    1.24  0.14  0.10  1.59  0.00
 2010-03    6.30 -0.26  3.18  4.26  0.01
 2010-04    0.44  3.78  0.77  1.60  0.01
 2010-05   -9.52  0.17 -2.54 -0.56  0.01
 ...         ...   ...   ...   ...   ...
 2015-09   -3.91 -0.28 -0.89  3.47  0.00
 2015-10    7.30 -2.26  0.21 -2.62  0.00
 2015-11   -0.30  1.69 -1.78  2.11  0.00
 2015-12   -1.74  0.93 -1.79  3.28  0.01
 2016-01   -6.29 -2.10  0.97  0.51  0.01
 
 [73 rows x 5 columns], 1:       Mkt-RF    SMB    HML    WML    RF
 Date                                   
 2010   13.94  12.62  -5.14  13.93  0.12
 2011   -6.79  -5.41  -4.76   6.30  0.04
 2012   16.87  -2.55   6.41   6.35  0.06
 2013   28.61  -0.46   5.07  23.68  0.02
 2014    3.30  -5.03  -3.88   0.79  0.02
 2015   -0.50   3.86 -11.12  17.10  0.02, 'DESCR': 'Global Factors\n--------------\n\nThis file was created using the 201601 Bloomberg database. Missing data are indicated by -99.99. \n\n  0 : (73 rows x 5 cols)\n  1 : Annual Factors: January-December (6 rows x 5 cols)'}

Reading from the World Bank

World Bank datasets are identified using indicators, a text code that represents each dataset. A full list of indicators can be retrieved using the pandas_datareader.get_indicators() function.


In [76]:
from pandas_datareader import wb
all_indicators = wb.get_indicators()
# examine some of the indicators
all_indicators.ix[:,0:1]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:4: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  after removing the cwd from sys.path.
Out[76]:
                               id
0              1.0.HCount.1.90usd
1               1.0.HCount.2.5usd
2            1.0.HCount.Mid10to50
3                 1.0.HCount.Ofcl
4             1.0.HCount.Poor4uds
...                           ...
16931    per_sionl.overlap_q1_rur
16932    per_sionl.overlap_q1_tot
16933    per_sionl.overlap_q1_urb
16934     s_policyholders_B2_life
16935  s_policyholders_B2_nonlife

[16936 rows x 1 columns]

In [77]:
# search of life expectancy indicators
le_indicators = wb.search("life expectancy")
le_indicators.iloc[:3,:2]


Out[77]:
                      id                                               name
9011         SE.SCH.LIFE  School life expectancy, primary to tertiary, b...
10312  SP.DYN.LE00.FE.IN           Life expectancy at birth, female (years)
10313     SP.DYN.LE00.IN            Life expectancy at birth, total (years)

In [78]:
# get countries and show the 3 digit code and name
countries = wb.get_countries()
# show a subset of the country data
countries.iloc[0:10].ix[:,['name','capitalcity','iso2c']]


/Users/cnc/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:4: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  after removing the cwd from sys.path.
Out[78]:
                   name  capitalcity iso2c
0                 Aruba          NaN    AW
1           Afghanistan          NaN    AF
2                Africa          NaN    A9
3                Angola          NaN    AO
4               Albania          NaN    AL
5               Andorra          NaN    AD
6         Andean Region          NaN    L5
7            Arab World          NaN    1A
8  United Arab Emirates          NaN    AE
9             Argentina          NaN    AR

In [79]:
# get life expectancy at birth for all countries from 1980 to 2014
le_data_all = wb.download(indicator="SP.DYN.LE00.IN", start='1980',end='2014')
le_data_all


Out[79]:
                    SP.DYN.LE00.IN
country       year                
Canada        2014       81.953049
              2013       81.772049
              2012       81.583512
              2011       81.448780
              2010       81.197561
...                            ...
United States 1984       74.563415
              1983       74.463415
              1982       74.360976
              1981       74.009756
              1980       73.609756

[105 rows x 1 columns]

In [80]:
# only US, CAN and MEX are returned by default
le_data_all.index.levels[0]


Out[80]:
Index(['Canada', 'Mexico', 'United States'], dtype='object', name='country')

In [81]:
# retrieve life expectancy at birth for all countries
# from 1980 to 2014
le_data_all = wb.download(indicator="SP.DYN.LE00.IN",country=countries['iso2c'],start='1980',end='2012')
le_data_all


/Users/cnc/anaconda/lib/python3.6/site-packages/pandas_datareader/wb.py:145: UserWarning: Non-standard ISO country codes: 1A, 1W, 4E, 6D, 6F, 6L, 6N, 6X, 7E, 8S, A4, A5, A9, B1, B2, B3, B4, B6, B7, B8, C4, C5, C6, C7, C8, C9, D2, D3, D4, D5, D6, D7, D8, D9, EU, F1, F6, JG, L4, L5, L6, L7, M1, M2, N6, O6, OE, R6, S1, S2, S3, S4, T2, T3, T4, T5, T6, T7, V1, V2, V3, V4, XC, XD, XE, XF, XG, XH, XI, XJ, XK, XL, XM, XN, XO, XP, XQ, XT, XU, XY, Z4, Z7, ZB, ZF, ZG, ZJ, ZQ, ZT
  'country codes: %s' % tmp, UserWarning)
Out[81]:
               SP.DYN.LE00.IN
country  year                
Aruba    2012       75.183195
         2011       75.047659
         2010       74.910634
         2009       74.771146
         2008       74.627683
...                       ...
Zimbabwe 1984       60.932756
         1983       60.711585
         1982       60.350878
         1981       59.886049
         1980       59.355561

[8679 rows x 1 columns]

We can do some interesting things with this data. The example we will look at, determines which country has the lowest life expectancy for each year. To do this, we first need to pivot this data, so that the index is the country name and the year is the column.


In [82]:
# le_data_all.pivot(index='country',columns='year')
le_data = le_data_all.reset_index().pivot(index='country',columns='year')
# examine pivoted data
le_data.iloc[:,0:3]


Out[82]:
                   SP.DYN.LE00.IN                      
year                         1980       1981       1982
country                                                
Afghanistan             41.875244  42.533610  43.235902
Albania                 70.235976  70.454463  70.685122
Algeria                 58.164024  59.486756  60.786341
American Samoa                NaN        NaN        NaN
Andorra                       NaN        NaN        NaN
...                           ...        ...        ...
West Bank and Gaza            NaN        NaN        NaN
World                   62.865800  63.200694  63.519298
Yemen, Rep.             50.559537  51.541341  52.492707
Zambia                  51.248293  50.943171  50.514366
Zimbabwe                59.355561  59.886049  60.350878

[263 rows x 3 columns]

In [83]:
# ask what is the name of the country for each year
# with the least life expectancy
country_with_least_expectancy = le_data.idxmin(axis=0)
country_with_least_expectancy


Out[83]:
                year
SP.DYN.LE00.IN  1980                    Cambodia
                1981                    Cambodia
                1982                 Timor-Leste
                1983                 South Sudan
                1984                 South Sudan
                                  ...           
                2008    Central African Republic
                2009    Central African Republic
                2010    Central African Republic
                2011    Central African Republic
                2012    Central African Republic
Length: 33, dtype: object

In [84]:
# and what is the minimum life expectancy for each year
expectancy_for_least_country = le_data.min(axis=0)
expectancy_for_least_country


Out[84]:
                year
SP.DYN.LE00.IN  1980    27.738976
                1981    33.449927
                1982    38.186220
                1983    39.666488
                1984    39.999537
                          ...    
                2008    46.163171
                2009    46.834927
                2010    47.532707
                2011    48.256976
                2012    49.012146
Length: 33, dtype: float64

In [85]:
# this merges the two frames together and gives us
# year, country and expectancy where the minimum exists
least = pd.DataFrame(data={'Country':country_with_least_expectancy.values,
                          'Expectancy':expectancy_for_least_country.values},
                    index= country_with_least_expectancy.index.levels[1])
least


Out[85]:
                       Country  Expectancy
year                                      
1980                  Cambodia   27.738976
1981                  Cambodia   33.449927
1982               Timor-Leste   38.186220
1983               South Sudan   39.666488
1984               South Sudan   39.999537
...                        ...         ...
2008  Central African Republic   46.163171
2009  Central African Republic   46.834927
2010  Central African Republic   47.532707
2011  Central African Republic   48.256976
2012  Central African Republic   49.012146

[33 rows x 2 columns]

In [ ]: